AITopics | Metadata Management

Collaborating Authors

Metadata Management

News Overviews Instructional Materials AI-Alerts Classics

Impact and influence of modern AI in metadata management

Yang, Wenli, Fu, Rui, Amin, Muhammad Bilal, Kang, Byeong

arXiv.org Artificial IntelligenceJan-27-2025

Metadata management plays a critical role in data governance, resource discovery, and decision-making in the data-driven era. While traditional metadata approaches have primarily focused on organization, classification, and resource reuse, the integration of modern artificial intelligence (AI) technologies has significantly transformed these processes. This paper investigates both traditional and AI-driven metadata approaches by examining open-source solutions, commercial tools, and research initiatives. A comparative analysis of traditional and AI-driven metadata management methods is provided, highlighting existing challenges and their impact on next-generation datasets. The paper also presents an innovative AI-assisted metadata management framework designed to address these challenges. This framework leverages more advanced modern AI technologies to automate metadata generation, enhance governance, and improve the accessibility and usability of modern datasets. Finally, the paper outlines future directions for research and development, proposing opportunities to further advance metadata management in the context of AI-driven innovation and complex datasets.

data mining, information retrieval, machine learning, (24 more...)

arXiv.org Artificial Intelligence

2501.16605

Country:

Europe (1.00)
North America > United States (0.28)

Genre:

Research Report (1.00)
Overview > Innovation (0.34)

Industry:

Information Technology > Services (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Law (0.93)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Information Management > Metadata Management (1.00)
Information Technology > Data Science > Data Quality (1.00)
(5 more...)

Add feedback

Linking the Dynamic PicoProbe Analytical Electron-Optical Beam Line / Microscope to Supercomputers

Brace, Alexander, Vescovi, Rafael, Chard, Ryan, Saint, Nickolaus D., Ramanathan, Arvind, Zaluzec, Nestor J., Foster, Ian

arXiv.org Artificial IntelligenceAug-25-2023

The Dynamic PicoProbe at Argonne National Laboratory is undergoing upgrades that will enable it to produce up to 100s of GB of data per day. While this data is highly important for both fundamental science and industrial applications, there is currently limited on-site infrastructure to handle these high-volume data streams. We address this problem by providing a software architecture capable of supporting large-scale data transfers to the neighboring supercomputers at the Argonne Leadership Computing Facility. To prepare for future scientific workflows, we implement two instructive use cases for hyperspectral and spatiotemporal datasets, which include: (i) off-site data transfer, (ii) machine learning/artificial intelligence and traditional data analysis approaches, and (iii) automatic metadata extraction and cataloging of experimental results. This infrastructure supports expected workloads and also provides domain scientists the ability to reinterrogate data from past experiments to yield additional scientific value and derive new insights.

artificial intelligence, information management, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2308.13701

Country: North America > United States > Illinois > Cook County (0.15)

Genre: Research Report (0.64)

Industry:

Energy (0.49)
Information Technology (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Sensing and Signal Processing (1.00)
Information Technology > Information Management > Metadata Management (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Architecture (1.00)

Add feedback

MetaEnhance: Metadata Quality Improvement for Electronic Theses and Dissertations of University Libraries

Choudhury, Muntabir Hasan, Salsabil, Lamia, Jayanetti, Himarsha R., Wu, Jian, Ingram, William A., Fox, Edward A.

arXiv.org Artificial IntelligenceMar-30-2023

Metadata quality is crucial for digital objects to be discovered through digital library interfaces. However, due to various reasons, the metadata of digital objects often exhibits incomplete, inconsistent, and incorrect values. We investigate methods to automatically detect, correct, and canonicalize scholarly metadata, using seven key fields of electronic theses and dissertations (ETDs) as a case study. We propose MetaEnhance, a framework that utilizes state-of-the-art artificial intelligence methods to improve the quality of these fields. To evaluate MetaEnhance, we compiled a metadata quality evaluation benchmark containing 500 ETDs, by combining subsets sampled using multiple criteria. We tested MetaEnhance on this benchmark and found that the proposed methods achieved nearly perfect F1-scores in detecting errors and F1-scores in correcting errors ranging from 0.85 to 1.00 for five of seven fields.

information retrieval, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2303.17661

Country: North America > United States > Virginia (0.29)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.95)
Information Technology > Information Management > Metadata Management (0.95)
Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.88)

Add feedback

Toward a Flexible Metadata Pipeline for Fish Specimen Images

Jebbia, Dom, Wang, Xiaojun, Bakis, Yasin, Bart, Henry L. Jr., Greenberg, Jane

arXiv.org Artificial IntelligenceNov-18-2022

Flexible metadata pipelines are crucial for supporting the FAIR data principles. Despite this need, researchers seldom report their approaches for identifying metadata standards and protocols that support optimal flexibility. This paper reports on an initiative targeting the development of a flexible metadata pipeline for a collection containing over 300,000 digital fish specimen images, harvested from multiple data repositories and fish collections. The images and their associated metadata are being used for AI-related scientific research involving automated species identification, segmentation and trait extraction. The paper provides contextual background, followed by the presentation of a four-phased approach involving: 1. Assessment of the Problem, 2. Investigation of Solutions, 3. Implementation, and 4. Refinement. The work is part of the NSF Harnessing the Data Revolution, Biology Guided Neural Networks (NSF/HDR-BGNN) project and the HDR Imageomics Institute. An RDF graph prototype pipeline is presented, followed by a discussion of research implications and conclusion summarizing the results.

artificial intelligence, machine learning, metadata, (17 more...)

arXiv.org Artificial Intelligence

2211.15472

Country:

Europe (0.68)
North America > United States > Pennsylvania (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.68)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.94)
Information Technology > Sensing and Signal Processing > Image Processing (0.94)
Information Technology > Information Management > Metadata Management (0.93)

Add feedback

The Variable Quality of Metadata About Biological Samples Used in Biomedical Experiments

Gonçalves, Rafael S., Musen, Mark A.

arXiv.org Artificial IntelligenceAug-17-2018

We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well- known databases: BioSample---a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples---a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.

immunology, internal medicine, metadata, (21 more...)

arXiv.org Artificial Intelligence

1808.06907

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Immunology (0.68)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.46)

Technology:

Information Technology > Communications > Web (0.93)
Information Technology > Information Management > Metadata Management (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.53)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

CiteSeerX: AI in a Digital Library Search Engine

AI MagazineSep-28-2015

Since then, the project has been directed by C. Lee Giles. While it is challenging to rebuild a system like Cite-SeerX from scratch, many of these AI technologies are transferable to other digital libraries and search engines. This is different from arXiv, Harvard ADS, and machine cluster to a private cloud using virtualization PubMed, where papers are submitted by authors or techniques (Wu et al. 2014). CiteSeerX extensively pushed by publishers. Unlike Google Scholar and leverages open source software, which significantly Microsoft Academic Search, where a significant portion reduces development effort. Red Hat of documents have only metadata (such as titles, Enterprise Linux (RHEL) 5 and 6 are the operating authors, and abstracts) available, users have full-text systems for all servers. Tomcat 7 is CiteSeerX keeps its own repository, which used for web service deployment on web and indexing serves cached versions of papers even if their previous servers. MySQL is used as the database management links are not alive any more. In additional to system to store metadata. Apache Solr is used paper downloads, CiteSeerX provides automatically for the index, and the Spring framework is used in extracted metadata and citation context, which the web application. In this section, we highlight four AI solutions that are Document metadata download service is not available leveraged by CiteSeerX and that tackle different challenges from Google Scholar and only recently available in metadata extraction and ingestion modules from Microsoft Academic Search. Finally, CiteSeerX (tagged by C, E, D, and A in figure 1).

citeseerx, information management, text processing, (19 more...)

AI Magazine

Country: North America > United States > New Jersey (0.28)

Genre: Research Report (0.93)

Industry:

Information Technology (1.00)
Education > Educational Setting (0.46)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Information Management > Metadata Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
(3 more...)

Add feedback

CiteSeerX: AI in a Digital Library Search Engine

Wu, Jian (The Pennsylvania State University) | Williams, Kyle (The Pennsylvania State University) | Chen, Hung-Hsuan (The Pennsylvania State University) | Khabsa, Madian (The Pennsylvania State University) | Caragea, Cornelia (University of North Texas) | Ororbia, Alexander (The Pennsylvania State University) | Jordan, Douglas (The Pennsylvania State University) | Giles, C. Lee (The Pennsylvania State University)

AAAI ConferencesJul-14-2014

CiteSeerX is a digital library search engine that provides access to more than 4 million academic documents with nearly a million users and millions of hits per day. Artificial intelligence (AI) technologies are used in many components of CiteSeerX, e.g. to accurately extract metadata, intelligently crawl the web, and ingest documents. We present key AI technologies used in the following components: document classification and deduplication, document and citation clustering, automatic metadata extraction and indexing, and author disambiguation. These AI technologies have been developed by CiteSeerX group members over the past 5–6 years. We also show the usage status, payoff, development challenges, main design concepts, and deployment and maintenance requirements. While it is challenging to rebuild a system like CiteSeerX from scratch, many of these AI technologies are transferable to other digital libraries and/or search engines.

citeseerx, digital library search engine

AAAI Conferences

Twenty-Sixth IAAI Conference

Technology:

Information Technology > Information Management > Metadata Management (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.80)

Add feedback